Description: the implementation of "The Zebra System"
Version: 1.2.0.20210721
Group name: YYDS
Authors: Haodong Liu and Jichen Zhao
Airbnb has become a popular platform among holidaymakers and tourists for lodging and rental houses. A host could manage his/her listings, and a guest could select one to fulfill his/her unique and personalised travelling plans. A public Airbnb dataset would be discovered for visualisation tasks. It regards the summary info and metrics of some listings in New York City (NYC), New York, USA for 2019. The data table is stored in the CSV file Airbnb_NYC_2019.csv, which is downloaded from the corresponding dataset info page on Kaggle.
Two information visualisation "systems" have been implemented - "The Giraffe System" (hereinafter called Giraffe) and "The Zebra System" (hereinafter called Zebra). It is because the visualisation tasks would be defined in the same context but different contents. For example, both Giraffe and Zebra would explore a task to consume information by analysing the data, but the specifications would be various. Anyway, we would expect that both of them could provide general insights into the NYC listings for 2019 since the visualisation tasks should help visualise and understand the primary data features and correlations.
NOTE: Please ensure that no exception is thrown in this section before executing the other sections.
import json
import altair as alt
import pandas as pd
data = pd.read_csv('Airbnb_NYC_2019.csv') # Load raw data from the data file.
print('The number of listings:', len(data))
The number of listings: 48895
The dataset contains too many records, and we do not want to bypass Altair's MaxRows check. Hence, we would like to randomly select 5000 listings as the items pending investigation for demonstration purposes. Missing values would be examined for further data processing.
data = data.sample(n = 5000, random_state = 0)
data.isnull().sum() # List the number of null values for each column.
id 0 name 2 host_id 0 host_name 5 neighbourhood_group 0 neighbourhood 0 latitude 0 longitude 0 room_type 0 price 0 minimum_nights 0 number_of_reviews 0 last_review 1036 reviews_per_month 1036 calculated_host_listings_count 0 availability_365 0 dtype: int64
Each item (i.e., a listing) originally has 16 attributes as follows. We would keep relevant attributes for visualisation tasks.
| Attribute | Description | Kept |
|---|---|---|
id |
The listing ID | √ |
name |
The listing name | |
host_id |
The host ID | √ |
host_name |
The host name | |
neighbourhood_group |
One of the 5 boroughs in NYC | √ |
neighbourhood |
One of the neighbourhoods in NYC | √ |
latitude |
The latitude coordinate | √ |
longitude |
The longitude coordinate | √ |
room_type |
One of the room types defined by Airbnb | √ |
price |
The price in US dollars for a night stay | √ |
minimum_nights |
The minimum number of nights that a guest can book | |
number_of_reviews |
The number of reviews | |
last_review |
The date of the latest review | |
reviews_per_month |
The number of reviews per month | √ |
calculated_host_listings_count |
The number of different listings for a particular host | √ |
availability_365 |
The number of days for which a particular listing is available in a year |
NOTE:
name and host_name would be removed. We already have unique IDs for listings and hosts, and we are not interested in their names. Hence, they would be dropped to also avoid any potential ethical issue.neighbourhood_group, neighbourhood, and room_type are categorical. This attribute type could be vital for information visualisation.minimum_nights and availability_365 would be removed. These attributes could be significantly subject to the host preferences, and we are not interested in such future data.number_of_reviews would be removed. The listings could be added at different time, and we reckon that the attribute reviews_per_month would be more meaningful. It contains missing values because a particular listing could have no review. In this case, we could simply fill these values with 0.last_review would be removed. We would focus on the generic trend, distribution, etc. This attribute could contribute little for visualisation tasks, since its value could be null and we do not have another clear date for comparison.data.drop(
['name', 'host_name', 'minimum_nights', 'number_of_reviews', 'last_review', 'availability_365'],
axis = 1,
inplace = True)
data.fillna({'reviews_per_month': 0}, inplace = True)
data
| id | host_id | neighbourhood_group | neighbourhood | latitude | longitude | room_type | price | reviews_per_month | calculated_host_listings_count | |
|---|---|---|---|---|---|---|---|---|---|---|
| 43813 | 33893655 | 138798990 | Manhattan | Tribeca | 40.72430 | -74.01110 | Entire home/apt | 225 | 0.00 | 1 |
| 32734 | 25798461 | 195803 | Manhattan | NoHo | 40.72555 | -73.99283 | Entire home/apt | 649 | 0.40 | 1 |
| 25276 | 20213045 | 2678122 | Brooklyn | Williamsburg | 40.71687 | -73.95012 | Entire home/apt | 300 | 0.35 | 3 |
| 36084 | 28670432 | 115993835 | Brooklyn | Sunset Park | 40.64036 | -74.00822 | Private room | 26 | 1.36 | 5 |
| 17736 | 13920697 | 29513490 | Brooklyn | Bedford-Stuyvesant | 40.68370 | -73.93325 | Entire home/apt | 125 | 0.12 | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 28768 | 22226156 | 14103679 | Manhattan | Murray Hill | 40.74623 | -73.97404 | Private room | 199 | 4.04 | 1 |
| 30348 | 23463017 | 9241743 | Manhattan | Harlem | 40.80899 | -73.94225 | Private room | 80 | 0.06 | 1 |
| 3812 | 2297192 | 11732741 | Manhattan | Financial District | 40.70825 | -74.00495 | Entire home/apt | 1000 | 0.00 | 1 |
| 38366 | 30212188 | 224414117 | Manhattan | Hell's Kitchen | 40.75523 | -73.99827 | Private room | 107 | 3.52 | 30 |
| 40157 | 31151589 | 16354583 | Brooklyn | Crown Heights | 40.67325 | -73.94666 | Private room | 30 | 0.33 | 1 |
5000 rows × 10 columns
Let us first define some common variables.
borough_label = 'Borough'
host_label = 'Host'
init_max_price = 500
listing_count_label = 'The number of listings'
max_price = data['price'].max()
min_price = data['price'].min()
neighbourhood_label = 'Neighbourhood'
per_cent_label = 'Per cent'
price_label = 'Price'
reviews_label = 'The number of reviews per month'
room_type_label = 'Room type'
select_all_label = 'All'
price_selection = alt.selection_single(
bind = alt.binding_range(
max = max_price,
min = min_price,
name = 'Max price: ',
step = 1
),
fields = ['max_price_filter'],
init = {'max_price_filter': init_max_price},
name = 'price_selection'
) # A price filter.
Before visualisation, it is necessary to understand the interactive nature of charts created using Altair. In plain English, it is essential for you to take advantage of the following features.
A multi selection is similar to a single selection, but it allows for multiple chart objects to be selected at once. By default, chart elements can be added to and removed from the selection by clicking on them while holding the
Shiftkey.
The 7 visualisation tasks are defined as follows. Zebra shares almost the same sections as Giraffe from the start till here because they are necessary preparations. However, the following sections could vary considerably from those of Giraffe since we perform the same visualisation tasks using different design decisions.
| Task | Action | Specification |
|---|---|---|
| #1 | Analyse and consume | Discover the number of listings by borough and room type to find a borough with the most listings and entire rooms/apartments. |
| #2 | Analyse and produce | Derive the per cent of room type by borough to compare between the 2 categories. |
| #3 | Search | Look up the number of Manhattan's neighbourhoods in the top 10 neighbourhoods by the number of listings. |
| #4 | Search | Browse the host ranking by the number of reviews per month and the number of listings to find the host ranking first in each case. |
| #5 | Search | Locate the most popular price range for each borough/room type. |
| #6 | Search | Explore any noticeable pattern in the price distribution by room type. |
| #7 | Query | Identify, compare, and summarise the correlations among prices, locations, the number of listings, boroughs, and room types. |
People might be interested in the question like "who has the most...?" when it comes to comparisons. Bar charts would be a good choice. However, if there are multiple categories for grouping, we had better consider whether the charts need to be stacked or grouped.
NOTE:
# Plot the grouped bar charts.
legend_selection = alt.selection_multi(bind = 'legend', fields = ['room_type'])
base = alt.Chart(data).mark_bar().encode(
color = alt.Color('room_type:N', legend = alt.Legend(title = room_type_label)),
column = alt.Column(
'neighbourhood_group:N',
header = alt.Header(
labelOrient = 'bottom',
title = borough_label,
titleOrient = 'bottom'),
sort = data['neighbourhood_group'].value_counts().index.tolist(), # Keep the same borough order as the stacked bar chart.
title = None
),
opacity = alt.condition(legend_selection, alt.value(1), alt.value(0.2))
).properties(width = alt.Step(40)).add_selection(legend_selection)
subplot_left = base.encode(
x = alt.X(
'room_type:N',
axis = None,
sort = '-y'
),
y = alt.Y('count():Q', axis = alt.Axis(title = listing_count_label)),
tooltip = 'count():Q'
).properties(title = listing_count_label + ' by ' + borough_label.lower()) # Visualise the number of listings by borough.
subplot_right = base.transform_aggregate(count = 'count():Q', groupby = ['neighbourhood_group', 'room_type']).transform_joinaggregate(
total = 'sum(count):Q', groupby = ['neighbourhood_group']
).transform_calculate(
per_cent = alt.datum.count / alt.datum.total
).encode(
x = alt.X(
'room_type:N',
axis = None,
sort = 'ascending'
),
y = alt.Y('per_cent:Q', axis = alt.Axis(format = '%', title = per_cent_label + ' of ' + room_type_label.lower())),
tooltip = alt.Tooltip('per_cent:Q', format = '.2%')
).properties(title = per_cent_label + ' of ' + room_type_label.lower() + 's by ' + borough_label.lower()) # Visualise the per cent of room types by borough.
(subplot_left | subplot_right).configure_title(anchor = 'middle').configure_view(stroke = None)
Sometimes bar charts are used for visualising the rank. We would not say that one surpasses the other, but which one might be more suitable for a specific scenario?
NOTE:
print('The number of unique neighbourhoods:', data['neighbourhood'].nunique())
The number of unique neighbourhoods: 184
neighbourhood_top10 = data['neighbourhood'].value_counts().head(10).index # Get the top 10 neighbourhoods by the number of listings.
data_neighbourhood_top10 = data.loc[data['neighbourhood'].isin(neighbourhood_top10)] # Keep data of the top 10 neighbourhoods.
# Plot the vertical bar chart.
legend_selection = alt.selection_multi(bind = 'legend', fields = ['neighbourhood_group'])
alt.Chart(
data_neighbourhood_top10,
title = 'Top 10 ' + neighbourhood_label.lower() + 's by ' + listing_count_label.lower()
).mark_bar().encode(
x = alt.X(
'neighbourhood:N',
axis = alt.Axis(title = neighbourhood_label),
sort = '-y'
),
y = alt.Y('count():Q', axis = alt.Axis(title = listing_count_label)),
color = alt.Color('neighbourhood_group:N', legend = alt.Legend(title = borough_label)),
opacity = alt.condition(legend_selection, alt.value(1), alt.value(0.2)),
tooltip = 'count():Q'
).properties(width = alt.Step(40)).add_selection(legend_selection)
Still for a ranking bar chart, it usually consists of a categorical attribute and a quantitative attribute which could be ordered. Is it always a good practice to visualise the data in a specific order?
NOTE:
print('The number of unique hosts:', data['host_id'].nunique())
The number of unique hosts: 4630
# Get the top 10 hosts by the number of reviews per month.
hosts = data['host_id'].unique()
n_reviews = []
for host in hosts:
data_reviews = data.loc[data['host_id'] == host]
n_reviews.append(data_reviews['reviews_per_month'].sum())
data_host_reviews_top10 = pd.DataFrame({'host_id': hosts, 'reviews_per_month': n_reviews})
data_host_reviews_top10 = data_host_reviews_top10.nlargest(10, 'reviews_per_month')
# Get the top 10 hosts by the number of listings.
host_top10 = data['host_id'].value_counts().head(10).index
# Keep the specific columns of the data of the top 10 hosts by the number of listings.
data_host_listings_top10 = data.loc[data['host_id'].isin(host_top10)]
data_host_listings_top10 = data_host_listings_top10[['host_id', 'calculated_host_listings_count']].drop_duplicates()
# Plot the unordered bar charts.
base = alt.Chart().mark_bar().encode(
y = alt.Y('host_id:N', axis = alt.Axis(title = host_label + ' ID'))
).properties(height = alt.Step(40))
subplot_left = base.encode(
x = alt.X('reviews_per_month:Q', axis = alt.Axis(title = reviews_label)),
tooltip = 'reviews_per_month:Q'
).properties(
data = data_host_reviews_top10,
title = 'Top 10 ' + host_label.lower() + 's by ' + reviews_label.lower()
) # Visualise the top 10 hosts by the number of reviews per month.
subplot_right = base.encode(
x = alt.X('calculated_host_listings_count:Q', axis = alt.Axis(title = listing_count_label)),
tooltip = 'calculated_host_listings_count:Q'
).properties(
data = data_host_listings_top10,
title = 'Top 10 ' + host_label.lower() + 's by ' + listing_count_label.lower()
) # Visualise the top 10 hosts by the number of listings.
subplot_left | subplot_right
Line charts might be preferred when we try to visualise any relationship or trend. But we should admit that histograms could be versatile. Why not just try and compare them?
NOTE:
# Plot the line charts.
legend_selection_left = alt.selection_multi(bind = 'legend', fields = ['room_type'])
legend_selection_right = alt.selection_multi(bind = 'legend', fields = ['neighbourhood_group'])
base = alt.Chart(data).transform_filter(alt.datum['price'] <= price_selection['max_price_filter']).encode(
x = alt.X(
'price:Q',
axis = alt.Axis(title = price_label + ' (binned)'),
bin = alt.Bin(step = 20)
),
y = alt.Y('count():Q', axis = alt.Axis(title = listing_count_label))
).add_selection(price_selection, alt.selection_interval(bind = 'scales'))
subplot_left = base.mark_line(interpolate = 'basis').encode(
color = alt.Color(
'room_type:N',
legend = alt.Legend(symbolStrokeWidth = 5,
title = room_type_label)
),
opacity = alt.condition(legend_selection_left, alt.value(1), alt.value(0.2))
).properties(title = 'By ' + room_type_label.lower()).add_selection(legend_selection_left) # Visualise the relationship between the number of listings and prices, by room type.
subplot_right = base.mark_line(interpolate = 'basis').encode(
color = alt.Color(
'neighbourhood_group:N',
legend = alt.Legend(symbolStrokeWidth = 5, title = borough_label),
scale = alt.Scale(scheme = 'dark2')
),
opacity = alt.condition(legend_selection_right, alt.value(1), alt.value(0.2))
).properties(title = 'By ' + borough_label.lower()).add_selection(legend_selection_right) # Visualise the relationship between the number of listings and prices, by borough.
(subplot_left | subplot_right).properties(
title = 'The relationship between ' + listing_count_label.lower() + ' and ' + price_label.lower() + 's'
).resolve_scale(color = 'independent').configure_title(anchor = 'middle')
Both plots could provide insights into the distribution of a quantitative attribute. Violin plots could also tell about the density. It does not mean that the violin plots are better. But in the context of distribution, which one would be preferred?
NOTE:
# Have some preliminary foundings about the extreme values and the distribution.
room_types = data['room_type'].unique()
prices_room_type_stats = pd.DataFrame()
for prices_room_type in [data.loc[data['room_type'] == room_type] for room_type in room_types]:
prices_room_type_stats = pd.concat([prices_room_type_stats, prices_room_type.describe()['price']], axis = 1)
prices_room_type_stats.columns = room_types
prices_room_type_stats
| Entire home/apt | Private room | Shared room | |
|---|---|---|---|
| count | 2595.000000 | 2290.000000 | 115.000000 |
| mean | 212.536416 | 87.529258 | 71.817391 |
| std | 324.402129 | 93.504706 | 85.790502 |
| min | 10.000000 | 0.000000 | 20.000000 |
| 25% | 120.000000 | 50.000000 | 30.000000 |
| 50% | 160.000000 | 70.000000 | 47.000000 |
| 75% | 229.500000 | 95.000000 | 79.000000 |
| max | 10000.000000 | 2000.000000 | 725.000000 |
# Plot the violin plot.
base = alt.Chart(data).properties(width = 100)
violin = base.transform_density(
'price',
as_ = ['price', 'density'],
extent = [min_price, init_max_price],
groupby = ['room_type']
).mark_area(orient = 'horizontal').encode(
x = alt.X(
'density:Q',
axis = alt.Axis(
grid = False,
labels = False,
ticks = True,
values = [0]),
impute = None,
scale = alt.Scale(nice = False, zero = False),
stack = 'center',
title = None),
y = alt.Y('price:Q', axis = alt.Axis(title = price_label)),
color = alt.Color('room_type:N', legend = alt.Legend(title = room_type_label))
) # Plot the violin part.
box = base.transform_filter(
(alt.datum['price'] >= min_price) & (alt.datum['price'] <= init_max_price)
).mark_boxplot(
color = 'black',
outliers = False,
size = 5
).encode(y = alt.Y('price:Q', axis = alt.Axis(title = price_label))) # Plot the box part.
(violin + box).facet(
column = alt.Column(
'room_type:N',
header = alt.Header(
labelOrient = 'bottom',
labelPadding = 0,
title = room_type_label,
titleOrient = 'bottom'
)
),
spacing = 0
).properties(
title = 'Primary density and distribution of ' + price_label.lower() + 's by ' + room_type_label.lower()
).resolve_scale(x = 'independent').configure_title(anchor = 'middle').configure_view(stroke = None)
It is incredibly convenient to generate a heatmap based on geo-location for this dataset due to the latitude and longitude attributes. Selecting a suitable colour scheme would be vital for successful visualisation. We reckon that it is better to use saturation of the same hue. However, we would like to pretend forgetting it and perform the specific visualisation task. XD
You live, and you learn.
NOTE:
# Plot the map part using saturation of the same hue.
nyc_geojson = open('NYC.geojson')
boroughs = data['neighbourhood_group'].unique().tolist()
borough_selection = alt.selection_single(
bind = alt.binding_select(
labels = [select_all_label] + boroughs,
name = 'Borough: ',
options = [None] + boroughs),
fields = ['properties.\\boro_name'],
init = {'properties.\\boro_name': select_all_label},
name = 'borough_selection'
)
subplot_left_base = alt.Chart(alt.Data(values = json.load(nyc_geojson)['features'])).mark_geoshape(stroke = 'white').encode(
color = alt.condition(borough_selection, alt.value('#e4e4e4'), alt.value('#f4f4f4')),
tooltip = 'properties.boro_name:N'
).properties(height = 500, width = 600).add_selection(borough_selection) # Plot the map.
nyc_geojson.close()
base = alt.Chart(data).transform_filter(
r"datum['price'] <= price_selection['max_price_filter'] & (borough_selection['properties\\.boro_name'] == null | datum['neighbourhood_group'] == borough_selection['properties\\.boro_name'])"
).add_selection(price_selection)
loc_listing = base.mark_circle(size = 15).encode(
latitude = 'latitude:Q',
longitude = 'longitude:Q',
color = alt.Color(
'price:Q',
legend = alt.Legend(orient = 'none', title = price_label),
scale = alt.Scale(scheme = 'lightorange')
),
tooltip = [alt.Tooltip('room_type:N', title = room_type_label), alt.Tooltip('price:Q', title = price_label)]
).properties(title = price_label + ' distribution by location') # Plot the price points to generate a heatmap.
subplot_top_right = base.mark_bar().encode(
x = alt.X('count():Q', axis = alt.Axis(title = listing_count_label)),
y = alt.Y('room_type:N', axis = alt.Axis(title = room_type_label), sort = '-x'),
tooltip = 'count():Q'
).properties(
height = alt.Step(40),
title = room_type_label + 's by ' + listing_count_label.lower()
).add_selection(borough_selection) # Visualise the room types by the number of listings under current filters.
subplot_bottom_right = base.mark_bar().encode(
x = alt.X(
'price:Q',
axis = alt.Axis(title = price_label + ' (binned)'),
bin = alt.Bin(step = 20)),
y = alt.Y('count():Q', axis = alt.Axis(title = listing_count_label)),
tooltip = 'count():Q'
).properties(
title = listing_count_label + ' by ' + price_label.lower()
).add_selection(borough_selection, alt.selection_interval(bind = 'scales')) # Visualise the number of listings by price under current filters.
((subplot_left_base + loc_listing) | (subplot_top_right & subplot_bottom_right)).resolve_scale(color = 'independent').configure_view(strokeWidth = 0)